The data for following exercises stem from the publication Grassland ecosystem recovery after soil disturbance depends on nutrient supply rate (Seabloom, Borer, and Tilman (2020)) and are publicly available at Dryad. The data were obtained during the long-term field experiment Cedar Creek LTER and target the effects of human disturbances on grassland ecosystem functioning and biodiversity.
Following description of the data collection process is taken from Seabloom, Borer, and Tilman (2020).
The disturbance treatment was replicated in the three old-fields (A, B and C) in a completely randomised block design (two treatments in each of three fields for a total of 6 35 × 55 m large plots). In 1982, in each of the fields, one of these two 35 × 55 m areas was selected to be disturbed with a disk harrow. After, the soil was hand raked to smooth the soil and remove any remaining vegetation, so that subsequent colonisation was solely from seeds or small rhizome fragments. Within each of the 6 large plots, the 54 small plots were arrayed in 6 × 9 grid with 1 m buffers between each plot. Aluminium flashing was buried to depth of 30 cm around each plot to prevent horizontal movement of nutrients and spreading of plants through vegetative growth.
The nutrient treatments were replicated six times in a completely randomised design in each of the 35 × 55 m plots (54 4 × 4 m small plots) yielding 324 (6 x 54) plots. The Seabloom analysis focuses on two nutrient treatments:
At peak biomass (mid-July to late August), all aboveground biomass was clipped in a 3 m by 10 cm strip (0.3 m2) in each plot. Note that there were 4 years when the disturbed plots were not sampled or only sampled in a single field. The biomass was sorted into dead, previous year’s growth (litter) and current year’s growth (live biomass). Live biomass was sorted to species, dried and weighed. We estimated total aboveground biomass as the summed biomass of all non-woody species in each 0.3 m2 sample, converted to g/m2.
Species richness is the number of species in each 0.3 m2 sample. We quantified plant diversity as the Effective Number of Species based on the Probability of Interspecific Encounter (ENSPIE), a measure of diversity that is more robust to the effects of sampling scale and less sensitive to the presence of rare species than species richness. ENSPIE is equivalent to the Inverse Simpson’s index of diversity (\(1 / \sum_{i=1}^{S} p_i^2\) where \(S\) is the total number of species and \(p_i\) is the proportion of the community biomass represented by species \(i\)).
seabloom <- read.table(here("2_Modeling/Data_preparation/seabloom-2020-ele-dryad-data/cdr-e001-e002-output-data.csv"),
sep = ",", header = TRUE)
dim(seabloom)
## [1] 5040 16
str(seabloom)
## 'data.frame': 5040 obs. of 16 variables:
## $ exp : int 1 1 1 1 1 1 1 1 1 1 ...
## $ field : chr "A" "A" "A" "A" ...
## $ plot : int 1 1 1 1 1 1 1 1 1 1 ...
## $ disk : int 0 0 0 0 0 0 0 0 0 0 ...
## $ yr.plowed : int 1968 1968 1968 1968 1968 1968 1968 1968 1968 1968 ...
## $ ntrt : int 9 9 9 9 9 9 9 9 9 9 ...
## $ nadd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ other.add : int 0 0 0 0 0 0 0 0 0 0 ...
## $ year : int 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 ...
## $ dur : int 1 2 3 4 5 6 7 8 9 10 ...
## $ precip.mm : num 879 858 876 994 859 ...
## $ precip.gs : num 374 551 527 503 512 ...
## $ mass.above: num 61.3 135.9 216.9 302.8 586.9 ...
## $ rich : int 13 15 14 14 16 14 8 7 9 13 ...
## $ even : num 0.463 0.156 0.204 0.14 0.269 ...
## $ ens.pie : num 6.02 2.34 2.85 1.96 4.31 ...
Exploring the data reveals 16 variables with each 5040 data points:
exp: treatments in split-plot design: 1 = disturbance
(Control or Disked, 35 × 55 m plots) and 2 = nutrient addition (9
levels, 4 × 4 m plots)field: three experimental fields A, B and Cplot: 54 plots within fieldsdisk: disking treatment (0 = intact at start of
experiment, 1 = disked at start of experiment)yr.plowed: last year field was plowed for agriculture
(A: 1968, B: 1957 and C: 1934)ntrt: nine levels representing different combinations
of nitrogen (0 to 27.2 g N year-1 added as
NH4NO3) and other nutrients (20 g m−2
year−1 P205; 20 g m−2
year−1 K20; 40 g m−2 year−1
CaCO3; 30.0 g m−2 year−1
MgSO4; 18 μg m−2 year−1
CuSO4; 37.7 μg m−2 year−1
ZnSO4; 15.3 μg m−2 year−1
CoCO2; 322 μg m−2 year−1
MnCl2 and 15.1 μg m−2 year−1
NaMoO4; details see Table S1 in publication). Nutrients were
applied twice per year in mid-May and mid-June.nadd: nitrogen additon rate (g/m2/yr)other.add: other nutrient treatment (0 = control, 1 =
other nutrients added)year: sampling yeardur: duration of experimentprecip.mm: annual precipitation (mm)precip.gs: growing season precipitation (mm)mass.above: aboveground biomass (g/m2)rich: species richness (species/0.3 m2)even: Simpson’s evennessens.pie: effective number of species (= probability of
interspecific encounter, equivalent to inverse Simpson’s diversity)
As ntrt is a coding scheme for different nutrients
added, it is crucial to treat it as such rather than as integers.
seabloom$ntrt <- as.factor(seabloom$ntrt)
str(seabloom$ntrt)
## Factor w/ 8 levels "2","3","4","5",..: 8 8 8 8 8 8 8 8 8 8 ...
Exercise: what do you think is wrong with keeping
ntrt as integers?
To simplify the analysis, we will take only the first 5 years of the dataset
seabloom <- seabloom %>% filter(year %in% c(1982, 1983, 1984, 1985))
The pairs function yields an overview over the numerical
data that we will use for the following exercises.
ggpairs(seabloom[, c(7, 11:16)], upper=NULL, lower = list(continuous = wrap("points", alpha = 0.05))) + theme_bw() + theme(panel.grid=element_blank())
A metamodel summarizes the concept behind a model and links it to theory. Here, the metamodel is visualized as a directed acyclic graph (DAG) which reads as: productivity (biomass) is directly influenced on the one hand by the environment (nutrients, disturbance and precipitation) and on the other hand by biodiversity (richness and evenness). Also some elements of the environment influence biodiversity and thus, have an additional indirect effect on productivity via biodiversity.
First, implement the metamodel into a linear model (LM). For this, three models are necessary: one that accounts for the direct- and two for the indirect effects.
If a linear model is visualized in a DAG, it becomes apparent that there is a set of permitted, but unanalyzed correlations among the predictors. These correlations have a huge influence on the coefficients between the predictors \(x_{1\ldots i}\) (here, nutrients, precipitation, richness and evenness) and the response \(y\) (here, biomass). Despite their importance, it is impossible to include any information about the reason of the correlations between the predictors. Further, they make it nearly impossible to create a proper causal model, since there are many possible causal relations that can create a set of “unanalyzed associations” which hinders interpretations (James B. Grace 2021).
Exercise: write the linear model in the DAG above in R syntax.
# lm.dir <- lm(mass.above ~ nadd + precip.mm + rich + even,
# data = seabloom)
# summary(lm.dir)
To account for the indirect effects, two additional LMs are necessary: one with richness and one with evenness as response.
lm.rich <- lm(rich ~ nadd + precip.mm, data = seabloom)
summary(lm.rich)
##
## Call:
## lm(formula = rich ~ nadd + precip.mm, data = seabloom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3673 -2.3744 -0.3474 2.1960 11.2278
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.280488 1.748846 15.027 <2e-16 ***
## nadd -0.186366 0.011704 -15.923 <2e-16 ***
## precip.mm -0.016239 0.001933 -8.401 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.521 on 1149 degrees of freedom
## Multiple R-squared: 0.22, Adjusted R-squared: 0.2187
## F-statistic: 162.1 on 2 and 1149 DF, p-value: < 2.2e-16
lm.even <- lm(even ~ nadd + precip.mm, data = seabloom)
summary(lm.even)
##
## Call:
## lm(formula = even ~ nadd + precip.mm, data = seabloom)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27070 -0.09337 -0.02885 0.06535 0.70936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0686810 0.0688456 0.998 0.3187
## nadd 0.0043439 0.0004607 9.428 <2e-16 ***
## precip.mm 0.0001863 0.0000761 2.448 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1386 on 1149 degrees of freedom
## Multiple R-squared: 0.07628, Adjusted R-squared: 0.07467
## F-statistic: 47.44 on 2 and 1149 DF, p-value: < 2.2e-16
The direct effect model showed that biomass was statistically
significantly positively related to nutrient input (the nitrogen additon
rate nadd) and species richness (rich), but
negatively to evenness (even). Additionally, nutrient input
had a statistically significantly negative influence on species richness
but a positive effect on evenness.
Structural equations aspire to represent cause-effect relationships and thus, they can be used to represent scientific, causal hypotheses. A key feature of structural equation models is the ability to investigate the networks of connections among system components (James B. Grace et al. 2012).
To evaluate the SEMs, we will use the package lavaan that relies on
the computation of covariance matrices to fit the structural equations
(this approach is known as ‘global estimation’). It comes with full
support for categorical data (any mixture of binary, ordinal and
continuous observed variables), can handle both latent and composite
variables, performs mediation analysis and calculates the magnitude of
indirect effects (Rosseel 2012, 2021).
library("lavaan")
## This is lavaan 0.6-16
## lavaan is FREE software! Please report any bugs.
Inspecting our set of variables shows that correlations between all variables are rather weak except the precipitation throughout the year and those throughout the field season.
ggpairs(seabloom[, c(7, 11:16)], lower = list(continuous = wrap("points", alpha = 0.05))) + theme_bw() + theme(panel.grid=element_blank())
Variables that are highly correlated with \(r > 0.85\) become redundant. Then, one of them can be dropped or they can be modeled summarized as a latent variable (James B. Grace 2006).
This toy model shall illustrate the logic of SEM. It contains the same variables and aims at evaluating the same metamodel as the LM before to ease comparability between the two methods. A huge benefit of SEM is that variables can appear as both predictors and responses what allows the evaluation of direct and indirect effects in one go (as shown in the directed acyclic graph (DAG) below). In SEM jargon, predictors (nodes that have no arrows pointing at them) are called exogenous-, while responses (nodes that have arrows pointing at them) endogeneous variables.
The table below summarizes the operators of lavaan
syntax. For example, an arrow in a DAG is represented by a tilde.
| Formula type | Operator | Meaning | Example |
|---|---|---|---|
| regression | ~ |
is regressed on | y ~ x |
| correlation | ~~ |
correlate errors for | y1 ~~ y2 |
| latent variable | =~ |
set reflective indicators | Height =~ y1 + y2 + y3 |
| composite variable | <~ |
set formative indicators | Comp1 <~ 1*x1 + x2 + x3 |
| mean/intercept | ~ 1 |
estimate mean for y without
xs |
y ~ 1 |
| label parameter | * |
name coefficients | y ~ b1*x1 + b2*x2 |
| define quantity | := |
define quantity | TotalEffect := b1*b3 + b2 |
| define thresholds | | |
for ordered categorial variable | u | t1 + t2 |
Exercise: translate the model in the DAG above into
lavaan syntax with the help of this table.
simple <-
"mass.above ~ nadd + rich + even + precip.mm + disk
rich ~ nadd + precip.mm
even ~ nadd + precip.mm
"
lavaan uses a \(\chi^2\) test to compare the estimated-
to the observed covariance matrix to compute the goodness of fit for the
SE model under the assumption that all observations are independent and
all variables follow a (multivariate) normal distribution (James B. Grace
2006). Note, that these distributional assumptions only apply
to endogenous variables, whereas the distribution of exogenous variables
has no bearing on assumptions (James B. Grace 2021).
We use a graphical method to assess the fit of the endogeneous variables to a normal distribution, the quantile-quantile plots (Q-Q plots). Hereby, the quantiles of the data are compared to those of a theoretical distribution (i.e., the normal distribution). If the data would be normally distributed, the points would match one to one and thus, align diagonally. In the Q-Q plot, this match is indicated by the black line.
The package MVN
allows to plot several Q-Q plots at once and also offers several tests
to multivariate
normal distribution (MVN). Here, we employ the Henze-Zirkler’s test
that has been recommended as a formal test of MVN (Mecklin and Mundfrom
2005).
library("MVN")
mvn(data = seabloom[, c(14:16)], mvnTest = "hz", univariatePlot = "qqplot")
## $multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 49.66399 0 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling rich 3.4469 <0.001 NO
## 2 Anderson-Darling even 34.4476 <0.001 NO
## 3 Anderson-Darling ens.pie 21.1284 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th
## rich 1152 10.111111 3.9834342 10.0000000 1.00000000 22.000000 7.0000000
## even 1152 0.272226 0.1440961 0.2401771 0.08043888 1.000000 0.1722154
## ens.pie 1152 2.508977 1.2426498 2.2784753 1.00000000 8.024094 1.4436818
## 75th Skew Kurtosis
## rich 13.0000000 0.1068243 -0.2346526
## even 0.3366675 2.2169874 7.8563731
## ens.pie 3.2892034 0.8212620 0.1845651
Alternatively, also histograms overlaid with a normal distribution of the same mean and standard deviation as the data allows to gain insight into the distribution. First, let’s define a function that plots the histogram and the expected density curve (it expects a numerical vector and a character string as input):
histWithDensity <- function(variable, name){
hist(variable, prob = TRUE, main = "", xlab = name)
x <- seq(min(variable), max(variable), length = 400)
y <- dnorm(x, mean = mean(variable), sd = sd(variable))
lines(x, y, col = "red", lwd = 2)
}
Then, we can apply this function to our three endogenous variables to plot the histograms with the expected density curves under a normal distribution:
par(mfrow = c(1, 3))
histWithDensity(seabloom$mass.above, "mass.above")
histWithDensity(seabloom$rich, "rich")
histWithDensity(seabloom$even, "even")
Exercise: would you infer that the endogenous variables meet the assumption of being (multivariate) normally distributed from the results of the the Q-Q plots, the Henze-Zirkler test and the histograms? And why (not)?
Transformation of the three endogenous variables shall make them more “normal”.
seabloom$log.mass.above <- sqrt(seabloom$mass.above)
seabloom$log.even <- sqrt(seabloom$even)
seabloom$log.rich <- sqrt(seabloom$rich)
mvn(data = seabloom[, c(17:19)], mvnTest = "hz", univariatePlot = "qqplot")
## $multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 6.094983 0 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling log.mass.above 5.8462 <0.001 NO
## 2 Anderson-Darling log.even 10.4552 <0.001 NO
## 3 Anderson-Darling log.rich 7.7752 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max
## log.mass.above 1152 17.3191683 4.5270842 16.7137186 6.0975405 37.166288
## log.even 1152 0.5070707 0.1229573 0.4900786 0.2836175 1.000000
## log.rich 1152 3.1076865 0.6736393 3.1622777 1.0000000 4.690416
## 25th 75th Skew Kurtosis
## log.mass.above 14.1172147 20.0348746 0.7156633 1.0374606
## log.even 0.4149884 0.5802305 1.1185004 2.3418906
## log.rich 2.6457513 3.6055513 -0.5689120 0.4299261
par(mfrow = c(1, 3))
histWithDensity(seabloom$log.mass.above, "log(mass.above)")
histWithDensity(seabloom$log.rich, "log(rich)")
histWithDensity(seabloom$log.even, "log(even)")
par(mfrow = c(1, 1))
Now, let’s fit the model with lavaan’s sem
function. As the data clearly deviates from a MVN, we use the
MLM estimator that provides standard errors and a \(\chi^2\) test statistic robust to
non-normality. Hereby, the Satorra-Bentler correction is used to correct
the value of the ML-based \(\chi^2\)
test statistic by an amount that reflects the degree of kurtosis (Rosseel 2012).1
fit.simple <- sem(simple, data = seabloom, estimator = "MLM")
## Warning in lav_data_full(data = data, group = group, cluster = cluster, :
## lavaan WARNING: some observed variances are (at least) a factor 1000 times
## larger than others; use varTable(fit) to investigate
Oups, the algorithm converges with a warning. Kindly, it informs us how to fix this.
Exercise: let’s obey the software and execute the code from the hint.
# varTable(fit.simple)
This reveals an enormous difference in magnitude between the variance
of biomass (mass.above) and the other variables.
To remove this difference in magnitude between the variables, we
divide mass.above and precip.mm by 100 what
changes the unit from g/m2 to 10 mg/m2 and from mm
to 0.1 m respectively. The boxplots show that the range of the variables
is now more similar.
seabloom$mass.above <- seabloom$mass.above / 100
seabloom$precip.mm <- seabloom$precip.mm / 100
boxplot(seabloom[, c(7, 11, 13:16)], las = 2)
Then, run sem again with the rescaled
biomass and precip.mm variables:
fit.simple <- sem(simple, data = seabloom, estimator = "MLM")
summary(fit.simple, fit.measures = TRUE)
## lavaan 0.6.16 ended normally after 2 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 12
##
## Number of observations 1152
##
## Model Test User Model:
## Standard Scaled
## Test Statistic 156.654 129.470
## Degrees of freedom 3 3
## P-value (Chi-square) 0.000 0.000
## Scaling correction factor 1.210
## Satorra-Bentler correction
##
## Model Test Baseline Model:
##
## Test statistic 1005.076 953.219
## Degrees of freedom 12 12
## P-value 0.000 0.000
## Scaling correction factor 1.054
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.845 0.866
## Tucker-Lewis Index (TLI) 0.381 0.463
##
## Robust Comparative Fit Index (CFI) 0.846
## Robust Tucker-Lewis Index (TLI) 0.383
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -4480.406 -4480.406
## Loglikelihood unrestricted model (H1) -4402.079 -4402.079
##
## Akaike (AIC) 8984.813 8984.813
## Bayesian (BIC) 9045.404 9045.404
## Sample-size adjusted Bayesian (SABIC) 9007.288 9007.288
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.211 0.191
## 90 Percent confidence interval - lower 0.183 0.166
## 90 Percent confidence interval - upper 0.240 0.217
## P-value H_0: RMSEA <= 0.050 0.000 0.000
## P-value H_0: RMSEA >= 0.080 1.000 1.000
##
## Robust RMSEA 0.210
## 90 Percent confidence interval - lower 0.180
## 90 Percent confidence interval - upper 0.242
## P-value H_0: Robust RMSEA <= 0.050 0.000
## P-value H_0: Robust RMSEA >= 0.080 1.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.067 0.067
##
## Parameter Estimates:
##
## Standard errors Robust.sem
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## mass.above ~
## nadd 0.111 0.007 15.119 0.000
## rich 0.049 0.011 4.505 0.000
## even -1.667 0.299 -5.567 0.000
## precip.mm -0.039 0.083 -0.475 0.635
## disk 0.941 0.084 11.222 0.000
## rich ~
## nadd -0.186 0.012 -15.865 0.000
## precip.mm -1.624 0.206 -7.870 0.000
## even ~
## nadd 0.004 0.001 7.116 0.000
## precip.mm 0.019 0.008 2.207 0.027
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .mass.above 2.023 0.149 13.573 0.000
## .rich 12.366 0.504 24.527 0.000
## .even 0.019 0.002 11.928 0.000
This time, the model converged, however, with poor fit:
To improve the model fit, we look for missing paths via the modification indices. They indicate an estimated drop in the model \(\chi^2\) resulting from freeing fixed parameters via including a missing path. 3.84 is considered as the critical threshold, the “single-degree-of-freedom \(\chi^2\) criterion” (James B. Grace 2021).
modindices(fit.simple, minimum.value = 3.84)
## lhs op rhs mi epc sepc.lv sepc.all sepc.nox
## 21 rich ~~ even 142.511 -0.171 -0.171 -0.352 -0.352
## 22 rich ~ mass.above 16.925 0.814 0.814 0.356 0.356
## 23 rich ~ even 142.511 -8.935 -8.935 -0.323 -0.323
## 25 even ~ mass.above 4.739 -0.018 -0.018 -0.215 -0.215
## 26 even ~ rich 142.511 -0.014 -0.014 -0.383 -0.383
## 27 even ~ disk 4.164 0.017 0.017 0.058 0.116
## 38 disk ~ mass.above 11.642 -0.249 -0.249 -0.867 -0.867
In the column mi (for modification index) we look for
high values. Note, however, that the modification indices are uninformed
suggestions and further adaptations of the model based on their
information needs to be based on theory.
In this example, the modification indices indicate–amongst other–a missing relation between richness and evenness. Including this relation into the model would expectedly improve its fit by a change in the \(\chi^2\) by 142.511.
This path is necessary as richness and evenness are computationally
related to each other (they are not independent quantities). Thus, let’s
include a correlation between rich and even
(Note: in SEM, a correlation between two variables points to an
omitted common cause/variable that drives this correlation). With the
function update() it is possible to directly incorporate
the missing path into the specified model without rewriting it from
scratch:
fit.simple.up <- update(fit.simple, add = "rich ~~ even")
summary(fit.simple.up, fit.measures = TRUE, rsq = TRUE)
## lavaan 0.6.16 ended normally after 28 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 13
##
## Number of observations 1152
##
## Model Test User Model:
## Standard Scaled
## Test Statistic 4.526 4.535
## Degrees of freedom 2 2
## P-value (Chi-square) 0.104 0.104
## Scaling correction factor 0.998
## Satorra-Bentler correction
##
## Model Test Baseline Model:
##
## Test statistic 1005.076 953.219
## Degrees of freedom 12 12
## P-value 0.000 0.000
## Scaling correction factor 1.054
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.997 0.997
## Tucker-Lewis Index (TLI) 0.985 0.984
##
## Robust Comparative Fit Index (CFI) 0.997
## Robust Tucker-Lewis Index (TLI) 0.985
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -4404.342 -4404.342
## Loglikelihood unrestricted model (H1) -4402.079 -4402.079
##
## Akaike (AIC) 8834.685 8834.685
## Bayesian (BIC) 8900.325 8900.325
## Sample-size adjusted Bayesian (SABIC) 8859.033 8859.033
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.033 0.033
## 90 Percent confidence interval - lower 0.000 0.000
## 90 Percent confidence interval - upper 0.075 0.075
## P-value H_0: RMSEA <= 0.050 0.692 0.691
## P-value H_0: RMSEA >= 0.080 0.029 0.030
##
## Robust RMSEA 0.033
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.075
## P-value H_0: Robust RMSEA <= 0.050 0.692
## P-value H_0: Robust RMSEA >= 0.080 0.029
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.015 0.015
##
## Parameter Estimates:
##
## Standard errors Robust.sem
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## mass.above ~
## nadd 0.111 0.007 15.964 0.000
## rich 0.049 0.011 4.359 0.000
## even -1.667 0.311 -5.358 0.000
## precip.mm -0.039 0.081 -0.483 0.629
## disk 0.941 0.084 11.222 0.000
## rich ~
## nadd -0.186 0.012 -15.865 0.000
## precip.mm -1.624 0.206 -7.870 0.000
## even ~
## nadd 0.004 0.001 7.116 0.000
## precip.mm 0.019 0.008 2.207 0.027
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .rich ~~
## .even -0.171 0.018 -9.336 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .mass.above 2.023 0.149 13.573 0.000
## .rich 12.366 0.504 24.527 0.000
## .even 0.019 0.002 11.928 0.000
##
## R-Square:
## Estimate
## mass.above 0.340
## rich 0.220
## even 0.076
Now, the model has a decent fit with a ratio of test statistic and degrees of freedom smaller than two (i.e. 2.26) and a \(p\)-value of 0.1. Further, also the robust CFI is now 1.
Another look at the modification indices shows that only one modification would yield an improvement, nut based on the experimental design AGB cannot influence the disk treatment which was randomly assigned. All other modifications would yield only smallest improvements to the model fit, so that we can ignore them confidently:
modindices(fit.simple.up, minimum.value = 0.01)
## lhs op rhs mi epc sepc.lv sepc.all sepc.nox
## 22 rich ~ mass.above 0.352 -0.122 -0.122 -0.054 -0.054
## 24 rich ~ disk 0.352 -0.115 -0.115 -0.014 -0.029
## 25 even ~ mass.above 2.895 0.014 0.014 0.168 0.168
## 27 even ~ disk 2.895 0.013 0.013 0.045 0.090
## 38 disk ~ mass.above 17.272 -0.369 -0.369 -1.292 -1.292
## 39 disk ~ rich 0.985 -0.003 -0.003 -0.026 -0.026
## 40 disk ~ even 3.505 0.183 0.183 0.053 0.053
To evaluate whether the simple or the updated model perform better,
we can calculate the Akaike
information criterion (AIC) and compute a significance test based on
the \(\chi^2\) test statistic with the
anova function.
anova(fit.simple, fit.simple.up)
##
## Scaled Chi-Squared Difference Test (method = "satorra.bentler.2001")
##
## lavaan NOTE:
## The "Chisq" column contains standard test statistics, not the
## robust test that should be reported per model. A robust difference
## test is a function of two standard (not robust) statistics.
##
## Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)
## fit.simple.up 2 8834.7 8900.3 4.5258
## fit.simple 3 8984.8 9045.4 156.6537 93.113 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA table we can see the difference in AIC = 150.13 is
clearly in favor for the updated simple.up model
(Note: a difference in the AIC of two is considered as
informative (Burnham and
Anderson 2002)).
Further, the difference in the \(\chi^2\) is 93.11–far beyond the threshold value of 3.84 that is necessary to detect a statistically significant difference on a confidence level of \(\alpha = 0.05\).
Using analysis of covariances allows for estimation of both unstandardized (raw) and standardized coefficients. While the analysis is based on covariances for estimating unstandardized coefficients, it is based on correlations for estimating standardized coefficients. The computational relation between correlations and covariances is:
\(r_{xy} = \frac{cov_{xy}}{SD_x \times SD_y}\)
… where \(r_{xy}\) are the correlations, \(cov_{xy}\) the covariances and \(SD_x\) and \(SD_y\) the respective standard deviations (the square roots of the respective variance \(= \sqrt{var_x}\) and \(= \sqrt{var_y}\)) (James B. Grace 2021).
Significance testing of standardized coefficients violates statistical principles. Significance tests are hence based on unstandardized (raw) coefficients and standardized coefficients are used for interpretation only (James B. Grace 2021).
For raw coefficients, the predicted effects are in raw units and have a pretty straight-forward interpretation: as in our example, e.g., the predicted change in 10 mg/m2 above-ground biomass associated with a certain change in added nutrients. For the standardized coefficients, however, the interpretation is more complex: they express the predicted changes in terms of standard deviation units. Thus, they are only interpretable within a sample which hinders generalization. Still, they ease comparison as they are in the same units across various pathways (James B. Grace 2021).
There are different standardizations available in
lavaan, whereof std.all is based on the
standard deviation:
standardizedsolution(fit.simple.up, type = "std.all")
## lhs op rhs est.std se z pvalue ci.lower ci.upper
## 1 mass.above ~ nadd 0.561 0.026 21.956 0.000 0.511 0.612
## 2 mass.above ~ rich 0.111 0.026 4.295 0.000 0.060 0.162
## 3 mass.above ~ even -0.137 0.024 -5.748 0.000 -0.184 -0.090
## 4 mass.above ~ precip.mm -0.012 0.025 -0.484 0.628 -0.061 0.037
## 5 mass.above ~ disk 0.269 0.022 12.333 0.000 0.226 0.311
## 6 rich ~ nadd -0.415 0.023 -18.342 0.000 -0.459 -0.371
## 7 rich ~ precip.mm -0.219 0.028 -7.876 0.000 -0.273 -0.164
## 8 even ~ nadd 0.267 0.030 8.829 0.000 0.208 0.327
## 9 even ~ precip.mm 0.069 0.030 2.285 0.022 0.010 0.129
## 10 rich ~~ even -0.352 0.024 -14.553 0.000 -0.399 -0.304
## 11 mass.above ~~ mass.above 0.660 0.023 28.382 0.000 0.615 0.706
## 12 rich ~~ rich 0.780 0.020 38.181 0.000 0.740 0.820
## 13 even ~~ even 0.924 0.017 54.116 0.000 0.890 0.957
## 14 nadd ~~ nadd 1.000 0.000 NA NA 1.000 1.000
## 15 nadd ~~ precip.mm 0.000 0.000 NA NA 0.000 0.000
## 16 nadd ~~ disk 0.000 0.000 NA NA 0.000 0.000
## 17 precip.mm ~~ precip.mm 1.000 0.000 NA NA 1.000 1.000
## 18 precip.mm ~~ disk 0.000 0.000 NA NA 0.000 0.000
## 19 disk ~~ disk 1.000 0.000 NA NA 1.000 1.000
To estimate derived quantities and their standard errors, we can use
the := operator in lavaan. The below code shows how to
derive the direct, indirect and total effect of nutrients on above
ground biomass by calculating compound paths and by summing up direct
and indirect effects.
derived <-
"mass.above ~ b1 * nadd + b2 * rich + b3 * even + precip.mm + disk
rich ~ b4 * nadd + precip.mm
even ~ b5 * nadd + precip.mm
dir.nut.effect := b1
indir.nut.effect := b2 * b4 + b3 * b5
tot.nut.effect := b1 + b2 * b4 + b3 * b5
"
fit.derived <- sem(derived, data = seabloom, estimator = "MLM")
summary(fit.derived, rsq = TRUE)
## lavaan 0.6.16 ended normally after 2 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 12
##
## Number of observations 1152
##
## Model Test User Model:
## Standard Scaled
## Test Statistic 156.654 129.470
## Degrees of freedom 3 3
## P-value (Chi-square) 0.000 0.000
## Scaling correction factor 1.210
## Satorra-Bentler correction
##
## Parameter Estimates:
##
## Standard errors Robust.sem
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## mass.above ~
## nadd (b1) 0.111 0.007 15.119 0.000
## rich (b2) 0.049 0.011 4.505 0.000
## even (b3) -1.667 0.299 -5.567 0.000
## precip.mm -0.039 0.083 -0.475 0.635
## disk 0.941 0.084 11.222 0.000
## rich ~
## nadd (b4) -0.186 0.012 -15.865 0.000
## precip.mm -1.624 0.206 -7.870 0.000
## even ~
## nadd (b5) 0.004 0.001 7.116 0.000
## precip.mm 0.019 0.008 2.207 0.027
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .mass.above 2.023 0.149 13.573 0.000
## .rich 12.366 0.504 24.527 0.000
## .even 0.019 0.002 11.928 0.000
##
## R-Square:
## Estimate
## mass.above 0.334
## rich 0.220
## even 0.076
##
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|)
## dir.nut.effect 0.111 0.007 15.119 0.000
## indir.nut.ffct -0.016 0.003 -5.262 0.000
## tot.nut.effect 0.095 0.006 14.769 0.000
In a saturated model all possible paths are specified and as a result, there are no degrees of freedom left (James B. Grace et al. 2010). They represent a special class of model because they allow for everything to add up, meaning we can completely recover the observed matrix of covariances. Unsaturated models have testable implications, however. Under global estimation, our comparison for calculating the GOF is the saturated model. The reason is that saturated models permit every covariance to be explained, so the model fit function value \(F_{ML}\) fit function goes to zero. Thus, also \(\chi^2 = 0\). In comparison, an unsaturated model will have a positive \(F_{ML}\) (James B. Grace 2021).
Adding each a path from disk to richness
(rich) and evenness (even) turns our simple
model into a saturated one.
satur <-
"mass.above ~ nadd + rich + even + precip.mm + disk
rich ~ nadd + precip.mm + disk
even ~ nadd + precip.mm + disk
rich ~~ even"
fit.satur <- sem(satur, data = seabloom, estimator = "MLM")
summary(fit.satur, rsq = TRUE)
## lavaan 0.6.16 ended normally after 31 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 15
##
## Number of observations 1152
##
## Model Test User Model:
## Standard Scaled
## Test Statistic 0.000 0.000
## Degrees of freedom 0 0
##
## Parameter Estimates:
##
## Standard errors Robust.sem
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## mass.above ~
## nadd 0.111 0.007 15.926 0.000
## rich 0.049 0.011 4.351 0.000
## even -1.667 0.313 -5.318 0.000
## precip.mm -0.039 0.081 -0.483 0.629
## disk 0.941 0.085 11.087 0.000
## rich ~
## nadd -0.186 0.012 -15.865 0.000
## precip.mm -1.624 0.206 -7.870 0.000
## disk -0.264 0.207 -1.274 0.203
## even ~
## nadd 0.004 0.001 7.116 0.000
## precip.mm 0.019 0.008 2.207 0.027
## disk 0.017 0.008 2.044 0.041
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .rich ~~
## .even -0.170 0.018 -9.375 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .mass.above 2.023 0.149 13.573 0.000
## .rich 12.348 0.509 24.274 0.000
## .even 0.019 0.002 12.024 0.000
##
## R-Square:
## Estimate
## mass.above 0.335
## rich 0.221
## even 0.080
Exercise: What modification indices do you expect from a saturated model?
# modindices(fit.satur, minimum.value = 0)
Now, let’s delete all statistically non-significant paths from the saturated model to obtain the most parsimonous model. There are two non-significant paths, one between disturbance and richness and between AGB and precipitation.
prune <-
"mass.above ~ nadd + rich + even + disk
rich ~ nadd + precip.mm
even ~ nadd + precip.mm + disk
rich ~~ even"
fit.prune <- sem(prune, data = seabloom, estimator = "MLM")
summary(fit.prune, rsq = TRUE, fit.measures = TRUE)
## lavaan 0.6.16 ended normally after 27 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 13
##
## Number of observations 1152
##
## Model Test User Model:
## Standard Scaled
## Test Statistic 1.861 1.844
## Degrees of freedom 2 2
## P-value (Chi-square) 0.394 0.398
## Scaling correction factor 1.009
## Satorra-Bentler correction
##
## Model Test Baseline Model:
##
## Test statistic 1005.076 953.219
## Degrees of freedom 12 12
## P-value 0.000 0.000
## Scaling correction factor 1.054
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 1.000 1.000
## Tucker-Lewis Index (TLI) 1.001 1.001
##
## Robust Comparative Fit Index (CFI) 1.000
## Robust Tucker-Lewis Index (TLI) 1.001
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -4403.010 -4403.010
## Loglikelihood unrestricted model (H1) -4402.079 -4402.079
##
## Akaike (AIC) 8832.020 8832.020
## Bayesian (BIC) 8897.661 8897.661
## Sample-size adjusted Bayesian (SABIC) 8856.368 8856.368
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.000 0.000
## 90 Percent confidence interval - lower 0.000 0.000
## 90 Percent confidence interval - upper 0.057 0.057
## P-value H_0: RMSEA <= 0.050 0.908 0.910
## P-value H_0: RMSEA >= 0.080 0.004 0.003
##
## Robust RMSEA 0.000
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.057
## P-value H_0: Robust RMSEA <= 0.050 0.907
## P-value H_0: Robust RMSEA >= 0.080 0.004
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.008 0.008
##
## Parameter Estimates:
##
## Standard errors Robust.sem
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## mass.above ~
## nadd 0.111 0.007 15.891 0.000
## rich 0.050 0.011 4.693 0.000
## even -1.664 0.315 -5.290 0.000
## disk 0.941 0.084 11.149 0.000
## rich ~
## nadd -0.186 0.012 -15.865 0.000
## precip.mm -1.624 0.206 -7.870 0.000
## even ~
## nadd 0.004 0.001 7.116 0.000
## precip.mm 0.019 0.008 2.207 0.027
## disk 0.013 0.008 1.706 0.088
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .rich ~~
## .even -0.170 0.018 -9.319 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .mass.above 2.024 0.149 13.565 0.000
## .rich 12.366 0.504 24.527 0.000
## .even 0.019 0.002 11.997 0.000
##
## R-Square:
## Estimate
## mass.above 0.337
## rich 0.220
## even 0.078
aictab(list(fit.satur, fit.prune),
c("saturated", "pruned"))
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## pruned 13 8832.34 0.00 0.75 0.75 -4403.01
## saturated 15 8834.58 2.24 0.25 1.00 -4402.08
When comparing the CFI and AIC, we see that the pruned model is slightly superior to the saturated model. The AIC is the lowest (with dAIC > 2), and the number of model parameters is lower and hence this model is the most parsimonous.
Now that we have our final model, let’s have a closer look on the
model via summary requesting also the standardized
estimates:
summary(fit.prune, fit.measures = TRUE, rsq = TRUE, standardized = TRUE)
## lavaan 0.6.16 ended normally after 27 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 13
##
## Number of observations 1152
##
## Model Test User Model:
## Standard Scaled
## Test Statistic 1.861 1.844
## Degrees of freedom 2 2
## P-value (Chi-square) 0.394 0.398
## Scaling correction factor 1.009
## Satorra-Bentler correction
##
## Model Test Baseline Model:
##
## Test statistic 1005.076 953.219
## Degrees of freedom 12 12
## P-value 0.000 0.000
## Scaling correction factor 1.054
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 1.000 1.000
## Tucker-Lewis Index (TLI) 1.001 1.001
##
## Robust Comparative Fit Index (CFI) 1.000
## Robust Tucker-Lewis Index (TLI) 1.001
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -4403.010 -4403.010
## Loglikelihood unrestricted model (H1) -4402.079 -4402.079
##
## Akaike (AIC) 8832.020 8832.020
## Bayesian (BIC) 8897.661 8897.661
## Sample-size adjusted Bayesian (SABIC) 8856.368 8856.368
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.000 0.000
## 90 Percent confidence interval - lower 0.000 0.000
## 90 Percent confidence interval - upper 0.057 0.057
## P-value H_0: RMSEA <= 0.050 0.908 0.910
## P-value H_0: RMSEA >= 0.080 0.004 0.003
##
## Robust RMSEA 0.000
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.057
## P-value H_0: Robust RMSEA <= 0.050 0.907
## P-value H_0: Robust RMSEA >= 0.080 0.004
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.008 0.008
##
## Parameter Estimates:
##
## Standard errors Robust.sem
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## mass.above ~
## nadd 0.111 0.007 15.891 0.000 0.111 0.564
## rich 0.050 0.011 4.693 0.000 0.050 0.115
## even -1.664 0.315 -5.290 0.000 -1.664 -0.137
## disk 0.941 0.084 11.149 0.000 0.941 0.269
## rich ~
## nadd -0.186 0.012 -15.865 0.000 -0.186 -0.415
## precip.mm -1.624 0.206 -7.870 0.000 -1.624 -0.219
## even ~
## nadd 0.004 0.001 7.116 0.000 0.004 0.267
## precip.mm 0.019 0.008 2.207 0.027 0.019 0.069
## disk 0.013 0.008 1.706 0.088 0.013 0.045
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .rich ~~
## .even -0.170 0.018 -9.319 0.000 -0.170 -0.351
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .mass.above 2.024 0.149 13.565 0.000 2.024 0.663
## .rich 12.366 0.504 24.527 0.000 12.366 0.780
## .even 0.019 0.002 11.997 0.000 0.019 0.922
##
## R-Square:
## Estimate
## mass.above 0.337
## rich 0.220
## even 0.078
Exercise: how would you interpret the meaning of the \(p\)-values of the single paths?
The meaning of the summary output is (James B. Grace 2021):
Degrees of freedom represents the number of paths
omitted from the model, which provide a capacity to test the
architecture of the modelp-value: probability of the data given our modelStd.err, the standard errorZ-value, the analogue to \(t\)-values derived from maximum likelihood
estimation (ML)P(>|z|), the \(p\)-values, the probability of obtaining a
\(z\) of the given value by chanceRegressions = the path coefficients
Estimates, the raw unstandardized coefficientsVariances = explained variance for endogenous variables
= estimates of the error variances
Estimates, estimates of the error variancesR-square is the variance explained for endogenous
variables, the \(1 - error\) variance
in standardized terms. Paths from error variables represent influences
from un-modeled factors.The inspect function retrieves information from a
lavaan object (for more options see the help):
inspect(fit.prune, what = "r2")
## mass.above rich even
## 0.337 0.220 0.078
The package lavaanPlot
allows to simply and straight-forwardly visualize diagrams from
lavaan objects.
Another library that handles lavaan objects would be semPlot.
suppressMessages(lavaanPlot_installed <- require(lavaanPlot))
if (!lavaanPlot_installed) {install.packages("lavaanPlot")}
Then, we can plot the results with significance levels displayed as asterisks.
library("lavaanPlot")
lavaanPlot(model = fit.prune,
node_options = list(shape = "box", color = "gray",
fontname = "Helvetica"),
edge_options = list(color = "black"),
coefs = TRUE, covs = FALSE, stars = "regress")
For more customization, check out the lavaanPlot
documentation.
library("semPlot")
# Add p-values and R^2: https://stackoverflow.com/questions/60706206/how-do-i-include-p-value-and-r-square-for-the-estimates-in-sempaths
# thePlot <- semPlotModel(fit.simple.up)
semPaths(fit.prune, what = "est", whatLabels = "est",
residuals = FALSE, intercepts = FALSE,
sizeMan = 10, sizeMan2 = 7, edge.label.cex = 1,
fade = FALSE, layout = "tree", style = "mx", nCharNodes = 0,
posCol = "#009e73ff", negCol = "#d55e00ff", edge.label.color = "black",
layoutSplit = TRUE, curve = 1, curvature = 1, #fixedStyle = 1,
exoCov = FALSE, rotation = 1)
What is missing in this plot are the \(R^2\) values for the endogenous variables. They can either be represented as the \(R^2\), the variance explained for endogenous variables or as the quantity of error variances (\(\zeta\)) either in raw or standardized units. Another option, if we wish to treat error variables like true causal influences, then we might use path coefficients for their effects. These are the square roots of the error variances (e.g., \(\sqrt{0.84} = 0.92\)) or alternatively, \(\sqrt{1 - R^2}\) (James B. Grace 2021).
To present the results of a SEM, following should be stated in the report (James B. Grace 2021):